Optimize 1x1 convolution for Network-in-Network style operation #1118

shelhamer · 2014-09-20T02:04:28Z

1x1 convolution with stride 1 and no padding is a special case of Caffe matrix multiplication convolution for which im2col / col2im transformations are actually the identity. For this special case the memory and transformation are skipped.

This optimizes the execution of 1x1 convolution i.e. NIN / CCCP convolutions.

@mavenlin

longjon · 2014-09-20T02:27:09Z

src/caffe/layers/conv_layer.cpp

+  // Special case: im2col is the identity for 1x1 convolution w/ stride 1,
+  // so flag for skipping the buffer and transformation.
+  is_1x1_ = kernel_w_ == 1 && kernel_h_ == 1
+      && stride_h_ == 1 && stride_w_ == 1;


We also need to check that there is zero padding, yes?

shelhamer · 2014-09-20T04:29:52Z

@longjon right on both counts. I've fixed both points.

@sguada can you explain the case for pad = 1? This is a 1x1 conv so the padding is meaningless and the following layer can always configure its own padding.

1x1 convolution with stride 1 is a special case of Caffe matrix multiplication convolution for which im2col / col2im transformations are actually the identity. For this special case the memory and transformation are skipped.

sguada · 2014-09-20T05:13:13Z

Sorry I was thinking in 3x3 case. No need for padding.

On Friday, September 19, 2014, Evan Shelhamer notifications@github.com
wrote:

@longjon https://github.com/longjon right on both counts. I've fixed
both points.

@sguada https://github.com/sguada can you explain the case for pad = 1?
This is a 1x1 conv so the padding is meaningless and the following layer
can always configure its own padding.

—
Reply to this email directly or view it on GitHub
#1118 (comment).

Sergio

longjon · 2014-09-20T05:20:13Z

src/caffe/layers/conv_layer.cpp

+      Dtype* col_diff = NULL;
+      if (!is_1x1_) {
+        col_data = col_buffer_.mutable_cpu_data();
+        col_diff = col_buffer_.mutable_cpu_diff();


By the way... could we save memory in the usual case by changing this line to col_buffer_.mutable_cpu_data() (i.e., by reusing the same buffer for both data and diff)? Perhaps I have missed something, but I don't see any reason in the code below why we need two separate buffers...

Good catch -- there's no need for the two at once since the col_data is only for the gradient w.r.t. the weight while col_diff is only for the gradient w.r.t. the bottom. Should we parallelize these in the future separate buffers will be needed, but that can be adjusted when we cross that bridge. Check out the follow-up commit.

longjon · 2014-09-20T05:24:01Z

Looks pretty good. It might be worth a comment near the col_buffer_ reshape to explain that memory will go lazily unused in the 1x1 case.

sguada · 2014-09-20T05:34:04Z

This reminds me that we should recover the shared_col_buffer across
convolutions.

Any suggestion how which class should be responsible for providing them?

On Friday, September 19, 2014, longjon notifications@github.com wrote:

Looks pretty good. It might be worth a comment near the col_buffer_
reshape to explain that memory will go lazily unused in the 1x1 case.

—
Reply to this email directly or view it on GitHub
#1118 (comment).

Sergio

shelhamer · 2014-09-20T05:46:28Z

@sguada seems to me that Net should broker shared blobs as requested then each layer can reshape them on-the-fly as needed. The memory is shared across layers but still owned by the Net and will be freed along with the Net. @longjon's PR lets the blobs grow to the largest size needed. Could be worth a try for fully-convolutional models in the regime where Caffe's matrix multiplication is faster than cuDNN (at present).

conv forward / backward only need one of the im2col data and diff at-a-time so consolidating the two saves a lazy allocation.

Optimize 1x1 convolution for Network-in-Network style operation

longjon · 2014-09-20T06:51:27Z

Awesome, this looks perfect to me. Thanks @shelhamer for writing this nice tight optimization (and being super responsive!)

sguada · 2014-09-25T00:12:47Z

@shelhamer I think we could do the same trick in the case that the filters have the same size as the bottoms, and no padding, so therefore not stride would be needed and only one matrix multiplication is needed. Useful to replace fully connected layers with convolutions.
Would like to add case?

shelhamer · 2014-09-25T00:45:07Z

@sguada yes the fully-connected / bottom dimensions = filter dimensions
case can be done as a gemm special case. It's worth adding in my opinion.

The other optimization is to allow batched im2col / col2im when memory
allows to gemm multiple inputs at once. Should be simple to add in our
implementation actually -- just need to reshape and index into col_buff.

On Wednesday, September 24, 2014, Sergio Guadarrama <
notifications@github.com> wrote:

@shelhamer https://github.com/shelhamer I think we could do the same
trick in the case that the filters have the same size as the bottoms, and
no padding, so therefore not stride would be needed and only one matrix
multiplication is needed.
Would like to add case?

—
Reply to this email directly or view it on GitHub
#1118 (comment).

longjon · 2014-09-25T01:24:15Z

@sguada @shelhamer
Re: the buffer-skipping trick: at this point we may as well ask the question: exactly when is the col buffer identical to the input? I believe the answer is...

(pad_h == 0 && pad_w == 0)
  && ((stride_w == kernel_w && width % kernel_w == 0 && kernel_h == 1)
      || (width == kernel_w && ((stride_h == kernel_h && height % kernel_h == 0)
                                || height == kernel_h)))

which is rather more general than both the special cases discussed so far.

Re: batched buffers: you would also get this for free in the above case. I wonder how much of a difference it makes though?

sguada · 2014-09-27T19:28:39Z

@longjon I think there are some symmetric cases missing in that formula. i.e:

(stride_h == kernel_h && height % kernel_h == 0 && kernel_w == 1)
(height == kernel_h && ((stride_w == kernel_w && width % kernel_w == 0)

How about this formula:

(pad_h == 0 && pad_w == 0) && 
((width == kernel_w && height == kernel_h) ||
(stride_w == kernel_w && width % kernel_w == 0 && (height == kernel_h || kernel_h == 1) ||
(stride_h == kernel_h && height % kernel_h == 0 && (width == kernel_w || kernel_w == 1))

longjon · 2014-09-27T23:17:06Z

@sguada No, it's trickier than that, and not symmetric in the way you are thinking, because row-major order goes left-to-right, up-to-down. You might think that you could get the same optimization for the column-major contiguous cases by changing the transposition parameters, but I think that cannot be done because of the channel dimension.

sguada · 2014-09-27T23:36:52Z

@longjon yeah you are right I forgot to consider the asymmetry introduced by the row-major.
Then for clarity let rewrite your formula (plus 1x1) as 4 different cases:

(pad_h == 0 && pad_w == 0) && 
  ((kernel_w_ == 1 && kernel_h_ == 1 && stride_h_ == 1 && stride_w_ == 1) ||
  (width == kernel_w && height == kernel_h) ||
  (stride_w == kernel_w && width % kernel_w == 0 && kernel_h == 1) ||
  (stride_h == kernel_h && height % kernel_h == 0 && kernel_w == width))

longjon · 2014-09-28T00:13:27Z

@sguada I believe your expression is correct, but I think it's less clear to include redundant cases. One way or another, I think the expression should be explained by comments (although one really has to have the right picture to understand it). E.g., I would suggest writing:

// For the column buffer to be identical to the input, we must have...
// zero padding, plus...
(pad_h == 0 && pad_w == 0) &&
   // the kernel must tile the input horizontally, and have height one...
  ((stride_w == kernel_w && width % kernel_w == 0 && kernel_h == 1)
   // unless it takes the whole width of the input, in which case it must
   // tile the input vertically, or take the whole height of the input!
   || (width == kernel_w && ((stride_h == kernel_h && height % kernel_h == 0)
                              || height == kernel_h)))

Optimize 1x1 convolution for Network-in-Network style operation

shelhamer added the enhancement label Sep 20, 2014

shelhamer mentioned this pull request Sep 20, 2014

CCCP pooling layer #498

Closed

shelhamer assigned longjon Sep 20, 2014

longjon reviewed Sep 20, 2014
View reviewed changes

shelhamer added the in progress label Sep 20, 2014

shelhamer force-pushed the 1x1-conv branch from 8557ae1 to ba6b259 Compare September 20, 2014 04:28

shelhamer added in progress and removed in progress labels Sep 20, 2014

optimize 1x1 convolution for Network-in-Network style layers

8109a6e

1x1 convolution with stride 1 is a special case of Caffe matrix multiplication convolution for which im2col / col2im transformations are actually the identity. For this special case the memory and transformation are skipped.

shelhamer force-pushed the 1x1-conv branch from ba6b259 to 8109a6e Compare September 20, 2014 04:33

longjon reviewed Sep 20, 2014
View reviewed changes

combine col_{data,diff} into single col_buff to halve memory usage

e8ba4c8

conv forward / backward only need one of the im2col data and diff at-a-time so consolidating the two saves a lazy allocation.

longjon added a commit that referenced this pull request Sep 20, 2014

Merge pull request #1118 from shelhamer/1x1-conv

de90c60

Optimize 1x1 convolution for Network-in-Network style operation

longjon merged commit de90c60 into BVLC:dev Sep 20, 2014

shelhamer deleted the 1x1-conv branch September 20, 2014 07:08

sguada mentioned this pull request Sep 27, 2014

Skip col_buffer allocation when not needed #1171

Closed

mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014

Merge pull request BVLC#1118 from shelhamer/1x1-conv

8991502

Optimize 1x1 convolution for Network-in-Network style operation

RazvanRanca pushed a commit to RazvanRanca/caffe that referenced this pull request Nov 4, 2014

Merge pull request BVLC#1118 from shelhamer/1x1-conv

7a81fea

Optimize 1x1 convolution for Network-in-Network style operation

shelhamer mentioned this pull request Feb 17, 2015

Next: release candidater #1849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize 1x1 convolution for Network-in-Network style operation #1118

Optimize 1x1 convolution for Network-in-Network style operation #1118

shelhamer commented Sep 20, 2014

longjon Sep 20, 2014

shelhamer commented Sep 20, 2014

sguada commented Sep 20, 2014

longjon Sep 20, 2014

shelhamer Sep 20, 2014

longjon commented Sep 20, 2014

sguada commented Sep 20, 2014

shelhamer commented Sep 20, 2014

longjon commented Sep 20, 2014

sguada commented Sep 25, 2014

shelhamer commented Sep 25, 2014

longjon commented Sep 25, 2014

sguada commented Sep 27, 2014

longjon commented Sep 27, 2014

sguada commented Sep 27, 2014

longjon commented Sep 28, 2014

Optimize 1x1 convolution for Network-in-Network style operation #1118

Optimize 1x1 convolution for Network-in-Network style operation #1118

Conversation

shelhamer commented Sep 20, 2014

longjon Sep 20, 2014

Choose a reason for hiding this comment

shelhamer commented Sep 20, 2014

sguada commented Sep 20, 2014

longjon Sep 20, 2014

Choose a reason for hiding this comment

shelhamer Sep 20, 2014

Choose a reason for hiding this comment

longjon commented Sep 20, 2014

sguada commented Sep 20, 2014

shelhamer commented Sep 20, 2014

longjon commented Sep 20, 2014

sguada commented Sep 25, 2014

shelhamer commented Sep 25, 2014

longjon commented Sep 25, 2014

sguada commented Sep 27, 2014

longjon commented Sep 27, 2014

sguada commented Sep 27, 2014

longjon commented Sep 28, 2014